Lab Assignment One: Exploring Table Data

In [1]:
print("team member: ")
print("long zhang" + "      " +"47778605")
print("zhengwen zhang" + "  "+ "47502277")
print("qinyu chen"+ "      " + "47813055")
team member: 
long zhang      47778605
zhengwen zhang  47502277
qinyu chen      47813055

1. Business Understanding

As is known to all that heart disease is the number one death causing diseases in US and other developed countries. It is interesting to learn which risk factors contribute to heart disease.From 1948 to now, the Framingham has been study heart disease for about 70 years. During 70 years, the Framingham Heart Research has continuously refreshed human understanding of heart health and diseases. The Framingham Heart Research has continued for three generations and published 3698 articles. Framingham's sign proudly says two words: "A small town that changed the heart of Americans" and "Framingham Heart Research House".

The "Logistic regression-To predict hear disease dataset" is publically available from kaggle,it is obtained from a ongoing study carried out at a town called Framingham,MA. The original dataset contains over 1000 rows(4238) and 16 columns,the table data can be downloaded for free and is ready to use immediately.it contains categorical features(e.g., male/female,current smoker or no, the goal of classification is to make prediction about whether the patient has 10-year risk of future CHD(last column), so this data set matches all the requirements for Lab one.

The reason why doctors or healthcare provider would be interested in the results is they can gain a better understanding of the risk factors associated with heart diseases, so they can make recommendations to existing patients on how to reduce the risk or complications.based on our model, they can also make predictions for new patients with the survey data collected. For patients, they can change their life style based on the prediction results accordingly.

Once we begin modeling, we will do some visualizations about the relationship of different features. Also, we will do PCA dimension reduction which greatly reduces the complexity of calculation, reduces the recognition errors caused by redundant information, and improves the accuracy of recognition. These algorithms will be useful to hird parties or researchers.

Dataset: heart disease prediction URL: https://www.kaggle.com/dileep070/heart-disease-prediction-using-logistic-regression

Question Of Interest: which patient will develop heat disease within 10 years?

2.Data Understanding

2.1 Dataset Description

In [2]:
#import the libraries
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy

#load the data set downloaded to local directory, adding "r" for unicode error in windows OS
#chd = pd.read_csv(r"C:\Users\tony\Dropbox\smu\CSE7324\assignment\LabOne\framingham.csv")
chd = pd.read_csv("/Users/chenqinyu/Desktop/framingham.csv")
#chd = pd.read_csv("/Users/terrence/Desktop/framingham.csv")
In [3]:
#preview of the first 5 observations
chd.head()
Out[3]:
male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose TenYearCHD
0 1 39 4.0 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
1 0 46 2.0 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
2 1 48 1.0 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
3 0 61 3.0 1 30.0 0.0 0 1 0 225.0 150.0 95.0 28.58 65.0 103.0 1
4 0 46 3.0 1 23.0 0.0 0 0 0 285.0 130.0 84.0 23.10 85.0 85.0 0
In [4]:
#chd1= chd.copy()
In [5]:
#get the data summary
chd.describe()
Out[5]:
male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose TenYearCHD
count 4238.000000 4238.000000 4133.000000 4238.000000 4209.000000 4185.000000 4238.000000 4238.000000 4238.000000 4188.000000 4238.000000 4238.000000 4219.000000 4237.000000 3850.000000 4238.000000
mean 0.429212 49.584946 1.978950 0.494101 9.003089 0.029630 0.005899 0.310524 0.025720 236.721585 132.352407 82.893464 25.802008 75.878924 81.966753 0.151958
std 0.495022 8.572160 1.019791 0.500024 11.920094 0.169584 0.076587 0.462763 0.158316 44.590334 22.038097 11.910850 4.080111 12.026596 23.959998 0.359023
min 0.000000 32.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 107.000000 83.500000 48.000000 15.540000 44.000000 40.000000 0.000000
25% 0.000000 42.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 206.000000 117.000000 75.000000 23.070000 68.000000 71.000000 0.000000
50% 0.000000 49.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 234.000000 128.000000 82.000000 25.400000 75.000000 78.000000 0.000000
75% 1.000000 56.000000 3.000000 1.000000 20.000000 0.000000 0.000000 1.000000 0.000000 263.000000 144.000000 89.875000 28.040000 83.000000 87.000000 0.000000
max 1.000000 70.000000 4.000000 1.000000 70.000000 1.000000 1.000000 1.000000 1.000000 696.000000 295.000000 142.500000 56.800000 143.000000 394.000000 1.000000
In [6]:
print(chd.info())
#data types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
male               4238 non-null int64
age                4238 non-null int64
education          4133 non-null float64
currentSmoker      4238 non-null int64
cigsPerDay         4209 non-null float64
BPMeds             4185 non-null float64
prevalentStroke    4238 non-null int64
prevalentHyp       4238 non-null int64
diabetes           4238 non-null int64
totChol            4188 non-null float64
sysBP              4238 non-null float64
diaBP              4238 non-null float64
BMI                4219 non-null float64
heartRate          4237 non-null float64
glucose            3850 non-null float64
TenYearCHD         4238 non-null int64
dtypes: float64(9), int64(7)
memory usage: 529.9 KB
None

As there are 4238 rows in the excel spread sheet, this data frame shows that some columns have less than 4238 counts, so there are some missing values for these observations.these features are either int64 or float64,so all data are numerical, there is no need to replace categorical data to numerical data

In [7]:
# creating a data description table
chd_des = pd.DataFrame()
chd_des['features']=chd.columns
chd_des['descriptions']=['sex','age','education','current smoker or not','number of cigarettes per day','blood pressure medications','have a stroker before','have hypertension before','disbetes','total cholesterallevel','systolic blood pressure','diastolic blood pressure','Body Mass Index','heart rate','glucose level','10 year risk of coronary heart disease CHD']
chd_des['scales']=['nominal']+['continous']+['nominal']+['continous']+['nominal']*5+['continous']*6+['binary']
chd_des['range']=['1:male;0:female']+['0-100']+['1-4']+['0:No;1:Yes']+['0-70']+['0:No;1:Yes']+['0:No;1:Yes']+['0:No;1:Yes']+['0:No;1:Yes']+['0-696']+['0-295']+['0-142.5']+['0-56.8']+['0-143']+['0-394']+['0:No;1:Yes']
chd_des
Out[7]:
features descriptions scales range
0 male sex nominal 1:male;0:female
1 age age continous 0-100
2 education education nominal 1-4
3 currentSmoker current smoker or not continous 0:No;1:Yes
4 cigsPerDay number of cigarettes per day nominal 0-70
5 BPMeds blood pressure medications nominal 0:No;1:Yes
6 prevalentStroke have a stroker before nominal 0:No;1:Yes
7 prevalentHyp have hypertension before nominal 0:No;1:Yes
8 diabetes disbetes nominal 0:No;1:Yes
9 totChol total cholesterallevel continous 0-696
10 sysBP systolic blood pressure continous 0-295
11 diaBP diastolic blood pressure continous 0-142.5
12 BMI Body Mass Index continous 0-56.8
13 heartRate heart rate continous 0-143
14 glucose glucose level continous 0-394
15 TenYearCHD 10 year risk of coronary heart disease CHD binary 0:No;1:Yes

Here we have all 16 attributes information. Additionally, we can clearly see the descriptions, scales and ranges. From our common sense, all the features are important factors in heart research. Some fearures have relationship directly. For example, if a peason isn't a current smoker, his or her cigsPerDay value is sure to be 0.

2.2 Verification of data quality

In [8]:
import missingno as mn
mn.matrix(chd)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a211defd0>

From the visualization above, we find features( education, cigsPerDay, BPMeds, totChol, BMI, glucose) have missing value. Some people learn by themselves, so it is hard to classify which group they are. As for features like BMI, BMI is a value that hard to measure, or it is inconvenient to measure. Also do other features which have missing value.

We know from dataset description that there are some missing values. First of all, let's check how many features are missing for all observations.

In [9]:
print (chd.isnull().sum())
#checking missing values amount of each features in this dataset
male                 0
age                  0
education          105
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

The results above suggest that 7 out of 16 features are having missing values, for example, 105 out of 4238 observations are missing 'education' information,next, We try to impute the missing data.

In [10]:
chd["education"].describe()
Out[10]:
count    4133.000000
mean        1.978950
std         1.019791
min         1.000000
25%         1.000000
50%         2.000000
75%         3.000000
max         4.000000
Name: education, dtype: float64
In [11]:
chd.education.value_counts()
#according to the statistic of education and the amount of distinct values of education, we could imputate the 105
#missing education values to "2.0"
Out[11]:
1.0    1720
2.0    1253
3.0     687
4.0     473
Name: education, dtype: int64

Next,we do the same thing for the next feature with missing values

In [12]:
chd["cigsPerDay"].describe()
Out[12]:
count    4209.000000
mean        9.003089
std        11.920094
min         0.000000
25%         0.000000
50%         0.000000
75%        20.000000
max        70.000000
Name: cigsPerDay, dtype: float64
In [13]:
chd.cigsPerDay.value_counts()
#according to the statistic of cigsPerDay and the amount of distinct values of cigsPerDay, we could imputate 
#the 29 missing cigsPerDay values to mode "0"
Out[13]:
0.0     2144
20.0     734
30.0     217
15.0     210
10.0     143
9.0      130
5.0      121
3.0      100
40.0      80
1.0       67
43.0      56
25.0      55
35.0      22
6.0       18
2.0       18
7.0       12
8.0       11
60.0      11
4.0        9
18.0       8
17.0       7
50.0       6
23.0       6
11.0       5
45.0       3
16.0       3
13.0       3
12.0       3
14.0       2
19.0       2
70.0       1
38.0       1
29.0       1
Name: cigsPerDay, dtype: int64
In [14]:
chd.BPMeds.value_counts()
#according to the counts of the distinct values in Blood pressure, we could know that most people are "0", 
#which means most of people have no blood pressure medication, so imputate the 53 missing BPMeds values to mode "0"
Out[14]:
0.0    4061
1.0     124
Name: BPMeds, dtype: int64
In [15]:
chd["totChol"].describe()
#according to Statistics of total alcohol we could know that most people are within the range:200-300, 
#so imputate the 50 missing totchol values to mean "236.721585"
Out[15]:
count    4188.000000
mean      236.721585
std        44.590334
min       107.000000
25%       206.000000
50%       234.000000
75%       263.000000
max       696.000000
Name: totChol, dtype: float64
In [16]:
chd["BMI"].describe()
#according to Statistics of BMI, we could know that most people are within the range:20-30, 
#so imputate the 19 missing BMI values to mean " 25.802008"
Out[16]:
count    4219.000000
mean       25.802008
std         4.080111
min        15.540000
25%        23.070000
50%        25.400000
75%        28.040000
max        56.800000
Name: BMI, dtype: float64
In [17]:
chd["heartRate"].describe()
#according to Statistics of heartRate, we could know that most people are within the range:70-80, 
#so imputate the 1 missing heartRate value to mean " 75.878924"
Out[17]:
count    4237.000000
mean       75.878924
std        12.026596
min        44.000000
25%        68.000000
50%        75.000000
75%        83.000000
max       143.000000
Name: heartRate, dtype: float64
In [18]:
chd["glucose"].describe()
#according to Statistics of glucose, we could know that most people are within the range:70-80, 
#so imputate the 388 missing glucose value to mean " 81.966753"
Out[18]:
count    3850.000000
mean       81.966753
std        23.959998
min        40.000000
25%        71.000000
50%        78.000000
75%        87.000000
max       394.000000
Name: glucose, dtype: float64

2.3 Imputation

Replacing the missing values with the above Mentioned values and coverting their data type to their original data type(float)

In [19]:
import copy
chd1= copy.deepcopy(chd)
In [20]:
#imputing null education with 50% value 2.0
chd["education"].fillna(2.0, inplace = True)
chd['education'] = chd['education'].astype(float)
In [21]:
chd.education.plot(kind='hist',alpha=0.5)
chd1.education.plot(kind='hist',alpha=0.5)
plt.show()
In [22]:
#imputing null cigsPerday with mode '0'
chd["cigsPerDay"].fillna(0, inplace = True)
chd['cigsPerDay'] = chd['cigsPerDay'].astype(float)
In [23]:
#imputing null BPMeds with mode '0'
chd["BPMeds"].fillna(0, inplace = True)
chd['BPMeds'] = chd['BPMeds'].astype(float)
In [24]:
#imputing null total cholecteral value with mean value of 236.721585
chd["totChol"].fillna(236.721585, inplace = True)
chd['totChol'] = chd['totChol'].astype(float)
In [25]:
#imputing null BMI value to be the mean of 25.802008
chd["BMI"].fillna(25.802008, inplace = True)
chd['BMI'] = chd['BMI'].astype(float)
In [26]:
#imputing null heart rate value to be the mean of 75.878924
chd["heartRate"].fillna(75.878924, inplace = True)
chd['heartRate'] = chd['heartRate'].astype(float)
In [27]:
#imputing null glucose value to be the mean of 81.966753
chd["glucose"].fillna(81.966753, inplace = True)
chd['glucose'] = chd['glucose'].astype(float)
In [28]:
#preview the table after imputation
chd
Out[28]:
male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose TenYearCHD
0 1 39 4.0 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.000000 0
1 0 46 2.0 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.000000 0
2 1 48 1.0 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.000000 0
3 0 61 3.0 1 30.0 0.0 0 1 0 225.0 150.0 95.0 28.58 65.0 103.000000 1
4 0 46 3.0 1 23.0 0.0 0 0 0 285.0 130.0 84.0 23.10 85.0 85.000000 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4233 1 50 1.0 1 1.0 0.0 0 1 0 313.0 179.0 92.0 25.97 66.0 86.000000 1
4234 1 51 3.0 1 43.0 0.0 0 0 0 207.0 126.5 80.0 19.71 65.0 68.000000 0
4235 0 48 2.0 1 20.0 0.0 0 0 0 248.0 131.0 72.0 22.00 84.0 86.000000 0
4236 0 44 1.0 1 15.0 0.0 0 0 0 210.0 126.5 87.0 19.16 86.0 81.966753 0
4237 0 52 2.0 0 0.0 0.0 0 0 0 269.0 133.5 83.0 21.47 80.0 107.000000 0

4238 rows × 16 columns

In [29]:
#this is to confirm that the imputation are successful for all features
print(chd.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
male               4238 non-null int64
age                4238 non-null int64
education          4238 non-null float64
currentSmoker      4238 non-null int64
cigsPerDay         4238 non-null float64
BPMeds             4238 non-null float64
prevalentStroke    4238 non-null int64
prevalentHyp       4238 non-null int64
diabetes           4238 non-null int64
totChol            4238 non-null float64
sysBP              4238 non-null float64
diaBP              4238 non-null float64
BMI                4238 non-null float64
heartRate          4238 non-null float64
glucose            4238 non-null float64
TenYearCHD         4238 non-null int64
dtypes: float64(9), int64(7)
memory usage: 529.9 KB
None
In [30]:
#this is to another confirmation that the imputation are successful for all features
print (chd.isnull().sum())
male               0
age                0
education          0
currentSmoker      0
cigsPerDay         0
BPMeds             0
prevalentStroke    0
prevalentHyp       0
diabetes           0
totChol            0
sysBP              0
diaBP              0
BMI                0
heartRate          0
glucose            0
TenYearCHD         0
dtype: int64
In [31]:
chd1 = chd.copy()
In [32]:
chd1.head()
Out[32]:
male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose TenYearCHD
0 1 39 4.0 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
1 0 46 2.0 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
2 1 48 1.0 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
3 0 61 3.0 1 30.0 0.0 0 1 0 225.0 150.0 95.0 28.58 65.0 103.0 1
4 0 46 3.0 1 23.0 0.0 0 0 0 285.0 130.0 84.0 23.10 85.0 85.0 0
In [33]:
#check if there is any duplicates
idx=chd.duplicated()
len(chd[idx])
Out[33]:
0

no duplicates have been found

In [34]:
chd2 = chd.iloc[:,0:15]
#creat a dataframe which  only has features, no target in it.
In [35]:
chd2.head()
Out[35]:
male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose
0 1 39 4.0 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0
1 0 46 2.0 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0
2 1 48 1.0 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0
3 0 61 3.0 1 30.0 0.0 0 1 0 225.0 150.0 95.0 28.58 65.0 103.0
4 0 46 3.0 1 23.0 0.0 0 0 0 285.0 130.0 84.0 23.10 85.0 85.0

3.Data visualization

3.1 plot for coronary heart disease in different ages

In [36]:
# Lets aggregate by age and count chd count
chd_grouped_age = chd.groupby('age').TenYearCHD.sum()
chd_grouped_age.plot(kind='bar')
plt.xlabel('Age')
plt.ylabel('counts of chd')
plt.title('relationship between age and chd')
plt.show()

We plot the chd counts versus age, obviously, increase in age shows increasing odds of having chd.

In [37]:
# Lets aggregate by cigsPerDay and count chd count
chd_grouped_age = chd.groupby('cigsPerDay').TenYearCHD.sum()
chd_grouped_age.plot(kind='bar')
plt.xlabel('cigsPerDay')
plt.ylabel('counts of chd')
plt.title('relationship between cigsPerDay and chd')
plt.show()

3.2 Bar plot for coronary heart disease in different genders

In [38]:
tenyears_chd = pd.crosstab(chd['TenYearCHD'], chd['male'])
print(tenyears_chd)
tenyears_chd.plot(kind='bar', stacked=True)
plt.xlabel("ten years coronary heart disease")
plt.ylabel("Number of Counts")
plt.title('Distribution of male')
plt.show() 
male           0     1
TenYearCHD            
0           2118  1476
1            301   343

As we can see from the above bar plot, for those who won't have heart disease in ten years, the porion of female is higher than that of male. for those who will have heart disease in ten years, the portion of male is slightly higher than that of female.

3.3 Violinplot for systolic blood pressure for different genders

In [39]:
sns.violinplot("male", "sysBP", data=chd,
               palette=["pink", "gray"]);

from the above plot, we could tell that the male's systolic blood pressure are a little bit higher the the female's, and the shape of male's voilin plot is more compressed and flat, which means that systolic blood pressure of male are more concentratedly ditrubuted around the median(around 125). And the range between upper adjacent value and lower adjacent value of male is slightly less than that of female.

3.4 Frequency plot of BMI per day

In [40]:
sns.distplot(chd['BMI'])
plt.xlabel(' BMI')
plt.ylabel('Frequency')
plt.title('Frequency of BMI')
Out[40]:
Text(0.5, 1.0, 'Frequency of BMI')

As we can tell from the above plot, the frequncy plot of BMI looks pretty much like the gaussian distribution. 25 BMI are the most frequent BMI values. 27 and 26 are the second and third most frequent BMI values

3.5 Bar chart for level of education count

In [41]:
colors = ['blue', 'purple', 'green', 'yellow']
chd['education'].value_counts().plot.bar(color = colors)
plt.xlabel('education')
plt.ylabel('number of counts')
plt.title('level of education count of sample')
Out[41]:
Text(0.5, 1.0, 'level of education count of sample')

According to the bar plot of the eduction level count, we could tell in this data sample, most people are in the "1.0" level education, which is the lowest level of education. And the dataset follows a pattern which is the higher the education is, the less the corresponding amount of count is.

3.6 Boxplot for heart rate in different education level

In [64]:
with sns.axes_style(style='ticks'):
    g = sns.factorplot("education", "heartRate", "male", data=chd, kind="box")
    g.set_axis_labels("education", "heartRate");

According to the above plot, the interesting thing is the heartrate median of female is always either equal or slightly higher that that of male in every education level. the heartrate median of male is about the same in education level of 1.0, 2.0, 3.0. the heartrate median of male in education level 4.0 is obviouly lower. And the upper quatile of female box plot with education level 1.0 and 2.0 are higher than that with education level 3.0 and 4.0.

Q1: What is the relationship between the age_range and the prevalent Hypertensive?

first define the age_range of age.

In [43]:
chd['age_range'] = pd.cut(chd.age, [30,40,50,60,70], labels=['30yo-40yo','40yo-50yo','50yo-60yo','60yo-70yo'])
In [44]:
chd.head()
Out[44]:
male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose TenYearCHD age_range
0 1 39 4.0 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0 30yo-40yo
1 0 46 2.0 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0 40yo-50yo
2 1 48 1.0 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0 40yo-50yo
3 0 61 3.0 1 30.0 0.0 0 1 0 225.0 150.0 95.0 28.58 65.0 103.0 1 60yo-70yo
4 0 46 3.0 1 23.0 0.0 0 0 0 285.0 130.0 84.0 23.10 85.0 85.0 0 40yo-50yo
In [45]:
Hyp_chd = pd.crosstab(chd['prevalentHyp'], chd['age_range'])
print(Hyp_chd)
Hyp_chd.plot(kind='bar', stacked=True)
plt.xlabel("prevalent hypertensive")
plt.ylabel("Number of Counts")
plt.title('Distribution of age_range')
plt.show() 
age_range     30yo-40yo  40yo-50yo  50yo-60yo  60yo-70yo
prevalentHyp                                            
0                   656       1225        777        264
1                    90        384        527        315

From the above bar plot, we could tell that for those who have prevalent hypertensive, the biggest age_range portion is 50yo-60yo, then followed by 40yo-50yo, 60yo-70yo, and there is very few portion for 30yo-40yo. for those who don't have prevalent hypertensive, the biggest age_range portion is 40yo-50yo, followed by 50yo-60yo, 30yo-40yo, and there is few portion for 60yo-70yo.

Q2: what is the relationship among diabetes, heart rate and CHD(coronary heart disease)?

chd['heartRate'].describe()
In [46]:
chd['heartRate_range'] = pd.cut(chd.heartRate, [40,60,100,150,
                                             ], labels=['low','normal','high'])
In [47]:
chd.head()
Out[47]:
male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose TenYearCHD age_range heartRate_range
0 1 39 4.0 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0 30yo-40yo normal
1 0 46 2.0 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0 40yo-50yo normal
2 1 48 1.0 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0 40yo-50yo normal
3 0 61 3.0 1 30.0 0.0 0 1 0 225.0 150.0 95.0 28.58 65.0 103.0 1 60yo-70yo normal
4 0 46 3.0 1 23.0 0.0 0 0 0 285.0 130.0 84.0 23.10 85.0 85.0 0 40yo-50yo normal
In [48]:
sns.violinplot(x="heartRate_range", y="TenYearCHD", hue="diabetes", data=chd, 
               split=True, inner="quart")
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a21cd7390>

To solve the question, we could use violin plot. According to the above violin plot, wer could draw the conclusion as following: 1>. from the aspect of heartrate: the lower the heartrate is, the people less likely have CHD in ten year, and also less likey to have diabete. 2>. from the aspect of diabete: people who have diabete are always likely to have normal or high heartrate, and the chance of getting CHD is also getting higher. 3>. from the aspect of CHD: people with CHD are more likely to have diabete and high heartrate.

Q3: what is the correlationships among these features? (predictors) and what is #the degree to which those variables are innerly related?

In [49]:
# set the plotting style
cmap = sns.set(style="darkgrid")

# set the figure size
f, ax = plt.subplots(figsize=(10, 10))

# exclude the NaN/null or object datatypes and plot the correlation map
sns.heatmap(chd2.corr(), cmap=cmap, annot=True)

f.tight_layout()
plt.title('Correlation Heatmap')
Out[49]:
Text(0.5, 1, 'Correlation Heatmap')

To solve the question of the correlationships between features, we could use correlation heatmap. from the above heat map and map legend, we could tell that the lighter the more postively related between two features, the darker the more negatively related between two features.
1> from the aspect of hypertensive: systolic blood pressure and diastolic blood pressure are strongly positively realted. And these two features are postively related to the hypertensive. 2> from the aspect of diabete: glucose is strongly postively related to diabete. 3> from the aspect of age: age is strongly negatively related to education level and ciggrates per day, which means in this sample the younger the more education level you have, the less amount of ciggrates you have per day.

In [50]:
import numpy as np
from sklearn.datasets import load_iris, load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
In [51]:
sns.set(style='white', context='notebook', rc={'figure.figsize':(10,7)})
In [52]:
chd.head()
Out[52]:
male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose TenYearCHD age_range heartRate_range
0 1 39 4.0 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0 30yo-40yo normal
1 0 46 2.0 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0 40yo-50yo normal
2 1 48 1.0 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0 40yo-50yo normal
3 0 61 3.0 1 30.0 0.0 0 1 0 225.0 150.0 95.0 28.58 65.0 103.0 1 60yo-70yo normal
4 0 46 3.0 1 23.0 0.0 0 0 0 285.0 130.0 84.0 23.10 85.0 85.0 0 40yo-50yo normal
In [53]:
#sns.pairplot(chd, hue='education')

4. Exceptional work

4.1 PCA and Umap

PCA is a common data analysis method, commonly used for dimensionality reduction of high-dimensional data, and can be used to extract the main feature components of the data

In [54]:
#import the dataset
df = chd1
In [55]:
#our target is TenYearCHD 
df['target'] = df.TenYearCHD.astype(np.int)
# Delete the column of target from our table
df = df.drop("TenYearCHD",axis=1)
In [56]:
# because different features have different units, we have to standarize the data
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X = df.values
X = StandardScaler().fit_transform(X)
y = df.target


pca = PCA(n_components=2)
pca.fit(X) # fit data and then transform it
X_pca = pca.transform(X)

# print the components

print ('pca:', pca.components_)
pca: [[-0.03542132  0.29993111 -0.11017367 -0.18122454 -0.14938741  0.19895738
   0.06758217  0.42884356  0.13767827  0.18572849  0.47933044  0.43036232
   0.27909183  0.12102497  0.14991992  0.17319423]
 [ 0.36068553 -0.1013683  -0.01982339  0.59347404  0.63517134  0.02931924
  -0.0127272   0.13297555  0.00325013  0.00973986  0.12343901  0.16058722
   0.02816935  0.13613748 -0.00331793  0.15623849]]

Here, we get our convolution which we use in PCA.

In [57]:
import seaborn as sns
cmap = sns.set(style="darkgrid") 

# the code is mentioned in class
def get_feature_names_from_weights(weights, names):
    tmp_array = []
    for comp in weights:
        tmp_string = ''
        for fidx,f in enumerate(names):
            if fidx>0 and comp[fidx]>=0:
                tmp_string+='+'
            tmp_string += '%.2f*%s ' % (comp[fidx],f[:-5])
        tmp_array.append(tmp_string)
    return tmp_array
  
plt.style.use('default')
# Data Analytics
pca_weight_strings = get_feature_names_from_weights(pca.components_, df.columns) 

# create some pandas dataframes from the transformed outputs
df_pca = pd.DataFrame(X_pca,columns=[pca_weight_strings])

from matplotlib.pyplot import scatter

# scatter plot the output, with the names created from the weights
ax = scatter(X_pca[:,0], X_pca[:,1], c=y, s=(y+2)*10, cmap=cmap)
plt.xlabel(pca_weight_strings[0]) 
plt.ylabel(pca_weight_strings[1])
#plt.figure(figsize=(10,20))
Out[57]:
Text(0, 0.5, '0.36* -0.10* -0.02*educ +0.59*currentS +0.64*cigsP +0.03*B -0.01*prevalentS +0.13*prevale +0.00*dia +0.01*to +0.12* +0.16* +0.03* +0.14*hear -0.00*gl +0.16*t ')

By using PCA, our data dimension is reduced from 16 to 2. From the plot, we find our dataset is divided into 2 cluster, although it is not very clearly. From the result, we can make a conclusion that TenYearCHD(our target) is related to many features. Some is highly related,and some is lowly related. It is hoped that the projection values after the projection are scattered as much as possible, because if the samples overlap, some samples disappear. Of course, this can also be understood from the perspective of entropy. The larger the entropy, the more information it contains. In order to achieve the goal, we continue our work.

In [58]:
# code also comes from class
def plot_explained_variance(pca):
    import plotly
    from plotly.graph_objs import Scatter, Marker, Layout, layout,XAxis, YAxis, Bar, Line
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=layout.XAxis(title='Principal components'), 
                         yaxis=layout.YAxis(title='Explained variance ratio'))
    })
        
pca = PCA(n_components=4)
X_pca = pca.fit(X)
plot_explained_variance(pca)

From the plot above, we can find the most large principal component is about 0.2, and the smu of the largest two components is lease than 4, which is not very good. If we want to get much clear representation, we need to find another powerful way. After google, we find UMAP is a good way. It can effortlessly process large datasets and high-dimensional data. Also,it combines the power of visualization with the ability to reduce the data dimension. In addition to retaining the local structure, it also preserves the global structure of the data. UMAP maps nearby points on the manifold to nearby points in the low-dimensional representation, and does the same mapping for far points. Therefore, we use UMAP here.

citation for umap

In [59]:
import umap
from sklearn.datasets import load_digits
import warnings
warnings.filterwarnings('ignore')
In [60]:
# get the column TenYearCHD as data target
data_target = df.iloc[:,15:]
In [61]:
#standardization
data_mean = data_target.mean()
data_std = data_target.std()
data_fea = (data_target - data_mean)/data_std
In [62]:
# here we use the code which comes from citation
umap_data = umap.UMAP(n_neighbors=5, min_dist=0.8, n_components=3).fit_transform(data_fea.values)
umap_data.shape
Out[62]:
(4238, 3)

It shows that there are 4238 samples, but only two feature columns (instead of the 16 we started with), so UMAP reduces down to 2D.

In [63]:
plt.figure(figsize=(10,8))
plt.scatter(umap_data[:,0], umap_data[:,1])
plt.scatter(umap_data[:,1], umap_data[:,2])
plt.scatter(umap_data[:,2], umap_data[:,0])
Out[63]:
<matplotlib.collections.PathCollection at 0x1a2f379d10>

From the plot above, our dataset is clustered by 5 groups. The dimensions have been reduced and we can imagine different transform components. The correlation between the transformed variables is very small.We can see that the correlation between the components obtained from UMAP is very small compared to the correlation between the components obtained from PCA. Therefore, UMAP tends to provide better results.

In [ ]: